\chapter{Algorithms}\label{chap:Algorithms}

Mavis' algorithm of color computation consists of four stages. First, multidimensional scaling converts the distance matrix into a set of coordinates in a high dimensional space and reduces the number of dimensions to two by selecting the first two axes. Second, coordinates are mapped onto part of the \emph{CIE Lab color space} \cite {McLAREN:1976aa}, and colors are created. Third, color spaces for each column is flipped and rotated to smooth the graph and remove color noises. Finally, the average color and average hue of each sequence are calculated, sorted, and output for further display.

We will discuss all these stages step by step in this chapter.

\section{Multidimensional Scaling}

This stage begins with the similarity scores described above. Each column in the MSA has a complete set $S$ of pairwise scores. Each pair of symbols $i,j$ in this column has a score $s_{i,j}=s_{j,i} \in [0,1]$. Here 1 stands for absolute confidence or similarity, while 0 for the opposite.

This dataset can be easily converted to a set $D$ of dissimilarities or distances between two symbols, by $$d_{i,j}=d_{j,i}=1-s_{i,j}$$

These distance values are re-organized as an $n \times n$ symmetric distance matrix $M$
\[ \begin{array}{cccccc}
            & \mbox{1}  & \mbox{2}  & \mbox{3}  & \mbox{...}  & n       \\
\mbox{1}    & 0         & d_{1,2}   & d_{1,3}   & \cdots      & d_{1,n} \\
\mbox{2}    & d_{2,1}   & 0         & d_{2,3}   & \cdots      & d_{2,n} \\
\mbox{3}    & d_{3,1}   & d_{3,2}   & 0         & \cdots      & d_{3,n} \\
\cdots      & \cdots    & \cdots    & \cdots    & \cdots      & \cdots  \\
n           & d_{n,1}   & d_{n,2}   & d_{n,3}   & \cdots      & 0 \end{array} \]
where $n$ is the number of symbols in the current column.

In most cases, gap symbols are removed because they are not compared and present in the similarity data. Therefore $n$ is actually the number of non-gap symbols in the current column. In certain circumstance, however, gaps can also be treated as normal symbols, have similarity scores, and show their colors.

The distance matrix $M$ is then converted into $n$ coordinates $$C_1, C_2, ..., C_{n-1}$$$$\mbox{where } C_i=(c_{i,1}, c_{i,2}, ..., c_{i,n-1})$$ in an ($n-1$)-dimensional space. This procedure involves a statistical technique called \emph{classical multidimensional scaling (classical MDS)}, also known as \emph{Torgerson-Gower scaling} or \emph{principle coordinates analysis (PCoA)} \cite{GOWER01121966}.

Classical multidimensional scaling is designed in such a way that the coordinates have the largest variance on the first axis. In other words, $$Var(c_{1,1}, c_{2,1}, ..., c_{n,1})$$ has the largest possible value in the scaling result. The second axis has the second largest variance, and so on. That means if we pick only first $m$ axes and form a new $m$-dimensional space, the distances between coordinates are still preserved to a large extent.

These $n$ symbols are placed into a two-dimensional space, in which the distances between each other are still preserved as much as possible.  Multidimensional scaling takes a distance matrix and assigns a coordinate to each item in a ($n-1$)-dimensional space, such that the Euclidean distance between any two coordinates is approximately equal to the original distance value between the corresponding items.

An statistical function called \emph{cmdscale()} \cite{R2009aa} is utilized in this multidimensional scaling procedure. This function is written in R, which is a programming language for statistical computing \cite{Gentleman:aa}, and is provided in R’s stat package. It reads a distance matrix and returns a set of points in a $k$-dimensional space, where $k$ is a user-defined parameter to the function and must be less than the number of points $n$ \cite{Cailliez:1983aa,Cox:2008aa}.

In our approach, $k$ is set to 2, meaning the distance matrix is scaled down to a two-dimensional space. However, a column of an MSA may have as few as one or two non-gap symbols, so that the distance matrix of this column is only $1 \times 1$ or $2 \times 2$. In such cases, $k=2$ will not be a valid parameter to \emph{cmdscale()} because of the $k<n$ requirement. So we let
\[
k = \left\{ \begin{array}{ll}
0 & \mbox{if $n=1$} \\
1 & \mbox{if $n=2$} \\
2 & \mbox{otherwise}
\end{array}
\right.
\]

Then we perform the multidimensional scaling on the distance matrix $M$. When $n<2$, the generated coordinates are lower than two-dimensional, and the missing coordinate values are filled by zeros.

\section{Color Space Mapping}

The above procedure returns a set $P$ of $n$ coordinates $(x,y)$ in a two-dimensional space, representing the $n$ symbols in a column of a multiple sequence alignment. This two-dimensional space is further mapped to a three-dimensional color space.

The CIE Lab color space \cite {McLAREN:1976aa} is a color-opponent space. Compared to the \emph{RGB} and \emph{CMYK} color models, the design of Lab color emphasizes more on approximating human vision rather than physical devices. This feature makes it suitable for our visualization purpose \cite{Margulis:2005aa}.

The name \emph{Lab} stands for the three dimensions: $L$ for lightness, and $a$ and $b$ for the color opponents based on nonlinearly compressed \emph{CIE XYZ color space} coordinates \cite{CIE:1932aa,Smith:1931aa}. However, coordinates in $P$ have only two dimensions $x$ and $y$. To map $P$ to the three-dimensional CIE Lab space, $L$ for lightness is set to a fixed value 75, and two other dimensions $a$ and $b$ are assigned as $x$ and $y$ scaled from $[0,1]$ to $[0, 100]$. $$a=x*100$$ $$b=y*100$$

At the end of this step, $n$ CIE Lab colors $(L, a, b)$ are created.

\section{Hue Rotation and Optimization}\label{sec:rotation}

One of the limitations in many previous MSA visualization techniques is that they introduced unnecessary confusion by using a fixed color scheme. Even for exactly same aligned columns, colors may be totally different, which will not represent any biological dissimilarities. To minimize this effect, we convert the colors from CIE Lab space to \emph{CIE LCH space} and perform rotations to eliminate noisy color patterns to the greatest extent possible. In this way, we smooth the color pattern and emphasize the real similarities and dissimilarities among sequences and blocks.

The \emph{CIE LCH color space}, also known as polar-Lab color space, is a cylindrical transformation of the CIE Lab space so that the a and b axes are converted to a polar coordinate system. The radial coordinate $C$ measures \emph{chroma} and the angular coordinate $H$ measures \emph{hue}. LCH space makes it easier to rotate colors, by changing their \emph{hue} value. The CIE LCH colors can be converted from CIE Lab colors using an R library called \emph{colorspace} \cite{Ihaka:2009aa,Zeileis:2009aa}.

Hue values are circular, meaning $0$ and $2\pi$ are identical. Given two angles $p$ and $q$, the difference between them is given by $$\mathrm{diff}(p,q) = [(p - q + \pi) \bmod 2\pi] - \pi$$ This calculation returns the difference in the interval $[-\pi, \pi)$.

Given a set $A$ of angles $a_i$, the average angle $\overline{A}$ is $$\overline{A}=\arctan{\frac{\displaystyle\sum_{i=1}^n \sin{a_i}}{\displaystyle\sum_{i=1}^n \cos{a_i}}}$$ and the variance $\mathrm{Var}(A)$ is $$\mathrm{Var}(A)=\frac{\displaystyle\sum_{i=1}^n \mathrm{diff}(\overline{A},a_i)^2}{n}$$

Given a colored MSA $Z$ with $m$ rows and $n$ columns, to estimate its overall noise level, a penalty function $\mathrm{pen}(Z)$ is defined by $$\mathrm{pen}(Z)=\displaystyle\sum_{j=1}^m \mathrm{Var}(h_{1j},h_{2j},\cdots,h_{nj})$$ where $h_{ij}$ is the hue value of the symbol at the $i$th column and $j$th row.

We look for an optimized set, $R$, of $n$ rotation angles, $r_i$, such that after rotating each hue in every $i$th column of $Z$ clockwise by $r_i$, the rotated MSA $Z^\prime$ achieves the minimized return value of the penalty function $\mathrm{pen}(Z^\prime)$.

The optimization procedure is performed by R's general-purpose optimization function \emph{optim()} \cite{R2009aa}. Among five available optimization algorithms provided by this function, we choose the \emph{bounded Broyden-Fletcher-Goldfarb-Shanno} (L-BFGS-B) method \cite{Byrd:1995aa}. This algorithm allows box constraints, meaning each variable can be given a lower and/or upper bound. Plus, it uses a limited amount of memory. The initial values are set to 0 and boundaries to $[0, 2\pi]$. More detailed comparison between different optimization functions and algorithms will be discussed later in Chapter 6.

\section{Hue Flipping}

One characteristic of the coordinates generated from multidimensional scaling is that they can not only be rotated, but also flipped, without changing the distances between each other. Adding flipping ability to the above rotation optimization procedure may result in lower minimal penalty values and better coloring results. To keep the penalty function simple and fast, we perform a heuristic flipping \emph{before} the rotation step.

For each pair of adjacent columns $C_i$ and $C_{i+1}$, $1 \le i \le n-1$, compare their hues row by row. Rows with gaps in these two columns are not taken into consideration. For those rows with non-gap symbols on both sides, record their hue values $$h_{i,1},h_{i+1,1},h_{i,2},h_{i+1,2},\cdots,h_{i,k},h_{i+1,k}$$ into two lists $H_i$ and $H_{i+1}$: $$H_i=h_{i,1},h_{i,2},\cdots,h_{i,k}$$ $$H_{i+1}=h_{i+1,1},h_{i+1,2},\cdots,h_{i+1,k}$$

In both lists, compare two neighboring hue values $h_{i,j}$ and $h_{i,j+1}$, $1\le j\le k-1$, using diff$()$ function described in section 3.3. If diff$(h_{i,j},h_{i,j+1})<-\frac{\pi}{8}$, the change from $h_{i,j}$ to $h_{i,j+1}$ is called a \emph{clockwise change}; if diff$(h_{i,j},h_{i,j+1})>\frac{\pi}{8}$, the change is called a \emph{counterclockwise change}; otherwise, it's considered to be unchanged.

If a column contains more clockwise changes than counterclockwise changes, it is called \emph{clockwise column}; if it contains more counterclockwise changes, it is called \emph{counterclockwise column}; otherwise, it is a neutral column. Note that since gap rows are filtered out of the comparison, the type of a column may change when compared to different adjacent columns. For instance, a column may be clockwise compared to the previous column, and counterclockwise compared to the next column.

If two adjacent columns are in opposite types, flip the latter one to make them in same type. Then the flipped alignment are sent back to the procedure discussed in Section \ref{sec:rotation}.

\section{Sequence Sorting}

The alignment graph is supposed to provide information on how sequences will group with each other. We sort sequences after coloring optimization, based on the average hue of each one, which is calculated by the average angle function proposed previously in section 3.3.

The fact that hue values are circular, results in an additional question on where to cut the circle to get a sorted list of sequences. In a sorted circle of $n$ average hue values, $h_1,h_2,\cdots,h_n$, we calculate the absolute difference between each pair of adjacent ones ($h_1$ is adjacent to $h_n$). $$\Delta h_{1,2}=|\mathrm{diff}(h_1,h_2)|$$ $$\Delta h_{2,3}=|\mathrm{diff}(h_2,h_3)|$$ $$\cdots$$ $$\Delta h_{n,1}=|\mathrm{diff}(h_n,h_1)|$$ Then, the circle is split at the biggest gap $$\Delta h_{max}=\Delta h_{i,i+1}=|\mathrm{diff}(h_i,h_{i+1})|$$ to best avoid separating similar sequences.

\section{Time Complexity}

Mavis has two time-consuming procedures in its MSA coloring algorithm. The first one is the multidimensional scaling which maps the distance matrix onto a high-dimensional space. The second one is the color optimization over rotation and flipping. For an MSA with $m$ sequences and $n$ columns, we analyze the time complexity of both procedures in this section.

Algorithm \ref{alg:mds} shows the pseudocode of the multidimensional scaling procedure. This algorithm loops through $n$ columns. For each column, the number of residues is no more than $m$, which is the number of sequences, so the size of the distance matrix is no more than $m\times m$. The time complexity of the classic multidimensional scaling function with an $m\times m$ distance matrix is $O(m^3)$ \cite{Tzeng:2008qy}. Therefore, the time complexity of our scaling procedure is $O(m^3n)$.

\begin{algorithm}
\caption{Multidimensional Scaling Procedure}
\label{alg:mds}
\begin{algorithmic}
  \STATE $scores \gets$ All similarity scores
  \STATE $n \gets$ Number of columns
  \STATE $coords \gets$ Empty set of coordinates
  \FOR {$i = 1 \to n$}
    \STATE $scores[column] \gets$ Subset of $scores$ for the $i$th column
    \STATE $distances \gets$ Convert all $scores[column]$ into distances
    \STATE $matrix \gets$ Distance matrix from $distances$
    \STATE $coords[column] \gets$ Classic multidimensional scaling on $matrix$ ($O(m^3)$)
    \STATE $coords[column] \gets$ Pick first two axes from $coords[column]$
  \ENDFOR
  \RETURN $coords$
\end{algorithmic}
\end{algorithm}

In order to optimize color patterns by rotation and flipping, we define the penalty function, whose pseudocode is shown in Algorithm \ref{alg:optim}, to assess the overall color noise level. Then we run L-BFGS-B algorithm on this penalty function to search for its the global minimum value. 

The penalty function consists of two steps. First, it applies a certain amount of rotation and flipping to each residue in the alignment, which takes $O(mn)$ time. Second, it loops though all $m$ rows, calculates the hue variance of each row, and sums them up. Since the calculation of hue variance on $n$ residues takes $O(n)$, this step takes $O(mn)$ time in total. So the total time complexity of penalty function is $O(mn)$.

The time complexity of L-BFGS-B algorithm is $O(mk)$ where $k$ is a moderate number representing the limited amount of memory used in the algorithm \cite{Byrd:1995aa}. Therefore, the time complexity of the color optimization procedure is $O(m^2nk)$. Since $k<m$, it can also be considered as $O(m^3n)$.

\begin{algorithm}
\caption{Coloring Penalty Function}
\label{alg:optim}
\begin{algorithmic}
  \FOR {$i = 1 \to n$}
    \FOR {$j = 1 \to m$}
      \STATE Apply rotation and flipping to the color in the $i$th column and the $j$th row
    \ENDFOR
  \ENDFOR
  \STATE $sumvar \gets 0$
  \FOR {$j = 1 \to m$}
    \STATE $rowvar \gets$ Hue variance of the $j$th row ($O(n)$)
    \STATE $sumvar \gets sumvar + rowvar$
  \ENDFOR
  \RETURN $sumvar$
\end{algorithmic}
\end{algorithm}

As a result, we estimate that the overall time complexity of Mavis algorithm is $O(m^3n)$.
